Trends in Multiprocessor and Distributed Operating System Designs 1
نویسندگان
چکیده
ion of lightweight processes bound to an object's virtual address-space. In the Chorus system, an actor represents a protection domain and it is the unit of resource allocation; actors are multithreaded. Communication between threads belonging to di erent tasks takes place using ports. The protection domains in Amoeba, Cronus, and Nexus are de ned in terms of an object manager, which is a heavyweight process and manages a collection of objects having the same functional abstraction. Such a manager process may be multithreaded to service multiple requests concurrently. Thread Usage Paradigms: There are number of common usages of the thread construct. Hauser et al. [1993] have described di erent paradigms of thread usage. Some of the important usages include exploiting parallelism, defer work, deadlock avoidance, pumps, slack processes, and rejuvenation. The use of threads for exploiting parallelism on multiprocessors is most obvious in this list. In distributed systems as well, threads can be run in quasi-parallel, allowing programmers to use the thread model to structure their applications. In the defer-work usage, a server can create a thread to perform some background activities whose completion is not required before returning a response to the client. This can reduce the response time seen by the clients. The use of threads as pumps or slack processes occurs in pipelined computations. A pump picks data from a pipeline, processes it and puts it into another pipeline. A thread is used as a slack process in a pipeline to perform preprocessing of data in the pipeline to \hopefully" reduce work for the server; for example replacing older data with newer values. Threads can be created dynamically to service incoming calls and thereby avoid deadlocks among communicating activities. Rejuvenation use occurs in situations of exceptions; in such a case a corrupted thread can be discarded and a new one can be created in its place. User-Level and Kernel-Level Threads: Several thread management libraries such as FastThreads [Anderson et al. 1989] and PCR [Weiser et al. 1989] were developed in the late 1980s to support user-level threads above traditional OS kernels that did not support threads. All scheduling and synchronization mechanisms were implemented at the user level, without involving the kernel. In contrast to kernel implemented threads, user-level threads are cheap to create, schedule, context switch, and manage. This is because the overheads in crossing protection domains in making kernel calls is eliminated. Such threads do not burden kernel resources. An application can be structured using a potentially large number of such threads. However, one major drawback arises when one thread makes a blocking system call; this prevents all of the other threads from getting the CPU cycles. Also, in a multiprocessor environment one cannot exploit parallelism by executing such threads in parallel on di erent processors. On the other hand, kernel-level threads do not su er from these problems but they are roughly an order of magnitude more expensive in terms of creation and context switch times [Anderson et al. 1991]. The above observations regarding the advantages as well as the limitations of user-level 9 threads motivated several system designs to support both kernel level and user level threads [Edler et al. 1988] [Marsh et al. 1991] [Anderson et al. 1991]. In general, in these designs the kernel provides event noti cations to the user-level thread management. In Psyche the term thread refers to a user-level thread, and the term virtual processor refers to a kernel level thread [Marsh et al. 1991]. Multiple virtual processors can execute, possibly in parallel on di erent CPUs, in a shared address space. The general goal is to support a large number of threads with a relatively fewer number of virtual processors, thus not burdening the kernel resources excessively. A thread can be bound to any of the virtual processors in its address space, and this binding can change at runtime. Psyche adopts a novel approach to make user-level threads rst-class objects in the sense that a blocking system call by a user-level thread does not prevent other threads from executing. Its approach is based on delivery of software interrupts from the kernel to the virtual processor when the thread executing on it makes a blocking call and also when such a call completes. The user-level interrupt handlers cause context switching of the virtual processor to another runnable thread. Psyche uses shared data structures between a virtual processor and the kernel for e cient asynchronous communication. Another interesting feature of the Psyche approach is its exibility in supporting di erent kinds of thread execution models. These models can di er in terms of scheduling policies, synchronization mechanisms, and thread creation/termination models etc.The Symunix design approach is to extend the conventional UNIX environment for largescale parallel machines [Edler et al. 1988]. To support user-level thread management it introduces a meta-system call to provide an asynchronous interface for making any system call. It also relies on kernel-to-user-space signals for call completion noti cations. Anderson et al. developed a user level thread management facility on the DEC Fire y multiprocessor workstation [Anderson et al. 1991]. In their design, a scheduler activation represents a virtual processor. An activation record serves three purposes. Similar to a kernel thread it is a virtual processor for executing user threads. It is also used for communicating events from kernel to the user level, and for saving the context of a user thread blocked in the kernel. When a thread blocks, a new activation record is given to the user level to notify this event. This new activation can be used to execute another thread. When a thread unblocks in the kernel, it is not resumed directly but a new activation record is given by the kernel to the user level with appropriate noti cation. SunOS 5.0 provides a very similar two-level thread model [Powell et al. 1991] [Stein and Shah 1992]. Threads are implemented at the user-level by a library package. An application can be structured as a collection of a large number of threads. The kernellevel de nes a set of lightweight processes (LWP), which can be viewed as virtual processors. LWPs are managed and scheduled by the kernel and can potentially run in parallel on a multiprocessor machine. A UNIX process can consist of some number of LWPs and threads. A thread needs an LWP for execution. The thread library can cause an LWP to switch from one thread to another. When a thread makes a kernel call, it remains bound to the lightweight process executing it. If the thread gets blocked because of its synchronization with other threads, its LWP can pick another thread for execution. In this model it is also 10 possible to permanently bind a thread to an LWP. The number of LWPs can be controlled by the programmer at runtime. The thread library can also dynamically adjust the number of LWPs to avoid deadlocks. A thread making a system call remains bound to its LWP until the call completes. When all the LWPs in a process block in inde nite wait in the kernel, the kernel sends a signal to the thread library. This results in the creation of a new LWP if there are some threads waiting to be executed. The fork primitive for process creation has been modi ed in this system. A process can fork another process in two ways. In the rst, the entire process is cloned along with all of its threads. Alternatively the new process can be created with only the calling thread. 3.2 Process Synchronization Mechanisms Process synchronization is achieved by using interprocess communication mechanisms based on either shared memory or message passing. Of these, shared memory is more e cient as compared to message passing. We rst examine here the various approaches for synchronization in shared memory based multiprocessor operating systems. Next we look at synchronization mechanisms in message-based systems. Spinning vs. Blocking: The classical approaches to synchronization have been based on mechanisms such as condition variables, boolean ags, semaphores or monitors. In uniprocessor systems busywaiting was considered undesirable because it introduces the possibility of deadlocks and wastes processor cycles. Implementation of critical sections is the most fundamental synchronization problem. In uniprocessor systems the most common way for implementing critical sections is based on disabling interrupts. This approach obviously does not extend to multiprocessor systems. Also, in such architectures the cost of suspending a thread (i.e. making the thread sleep) and restarting it later when contending to enter a short critical section may be signi cantly more than the cost of keeping it busy-waiting. When no threads are waiting for a processor, the lock requesting thread can be kept spinning. If the thread holding the lock is waiting for a processor, then spinning by the lock requesting thread is useless; it should give up its processor to the thread holding the lock. The issue to consider here is, how long should a thread spin before blocking, when the thread that is holding the lock is running and there are other threads waiting for a processor. A detailed study of spinning vs. blocking is presented in [Karlin et al. 1991]. A scheme in which the time spent in spinning is equal to the context switch time is called competitive. The empirical study by Karlin et al. shows that adaptive competitive schemes that take into account the waiting time experienced during the past lock acquisition requests perform better than non-adaptive algorithms. Some system designs have introduced the concept of advisable processor locks (APL) [Campbell et al. 1991] in which each lock contains the amount of time it would be locked. Based on this information other contending processes can decide whether they should spin 11 or sleep. Similar approach can be seen in the SunOS 5.0 kernel [Eykholt et al. 1992], which allows sleep or spin option to be speci ed as a part of the initialization of the synchronization variable. Design Alternatives for Spin-Locks: Spin-locks are used extensively in shared memory multiprocessor operating systems for implementing short critical sections. They are also used as building blocks for implementing long-term coarse-grain locks or semaphores [Schimmel 1994]. Even when one is using UNIX's sleep/wakeup mechanism for synchronization in a multiprocessor environment, short-term critical sections are needed to avoid race conditions when one process executes the sleep operation after testing a condition and another process concurrently executes the wakeup operation. Spinning by a processor can potentially cause a signi cant level of bus/network tra c and memory tra c. It can lower the performance of the processor holding the lock. In a cache coherent shared memory multiprocessor a spinning processor executing the Test-andSet instruction can cause signi cant performance degradation due to invalidation signals on the bus. Other schemes that adopt snooping approach using Test-and-Test-and-Set improve performance only marginally because on each lock release signi cant amount bus tra c can get generated due to the spinning processors. In the context of shared memory multiprocessors various design options for spin-lock implementation were independently studied by Anderson [1990], and Graunke and Thakkar [1990]. These include various schemes for inserting delays between successive attempts by a process to gain the lock. Algorithms using Test-and-Set or Test-and-Test-and-Set with exponential delays perform better than those without delays. They also concluded that in general queueing based schemes in which each processor spins on a separate ag (according to its position in the queue) perform better than most other schemes. Spin-lock based mutual exclusion algorithms for distributed memory architectures are discussed by Mellor-Crummey and Scott [1991]. The basic approach to designing e cient algorithms for such architectures is that a process should be spinning on local ags as much as possible to reduce network tra c. Semaphore Based Designs: Most of the critical issues related to synchronization problems in multiprocessor operating systems were addressed in the context of various e orts to implement multithreaded versions of UNIX for symmetric multiprocessing (SMP). In such systems multiple threads may be executing the kernel code at the same time while running on di erent processors. Some multiprocessor implementations of the UNIX operating system have used semaphores for synchronizing access to shared kernel data structures [Bach 1984]. Semaphore operations are implemented as spin-lock controlled critical sections. Most of the problems in using semaphore based synchronization are related to race situations in testing a condition and then blocking a thread using the P operation [Ruane 1991]. 12 The traditional approach in uniprocessor UNIX kernels for process synchronization has been based on sleep/wakeup mechanism. A set of processes can wait for a condition by executing sleep operation on an address value. The wakeup operation on that address value resumes all of the sleeping processes. At many places in the UNIX kernel resumption of all of the blocked processes is desired. Emulating this using semaphores is somewhat cumbersome. For this reason in implementing a multithreaded UNIX kernel using semaphores some additional semaphore operations have been proposed that resume all blocked processes [Schimmel 1994]. Sleep/Wakeup-based Designs: In uniprocessor UNIX systems when a wakeup operation is executed, all the blocked processes are resumed and given the CPU quantum in some sequential order. Each resumed process has to re-check the conditions that caused it to sleep earlier, because these conditions have might have been altered by some other resumed (or a new) process. A process may have to sleep again after re-checking the conditions. If the critical section is short, then this may not cause a problem because each process will nish executing its critical section in the given time quantum. However, the use of such a mechanism in multiprocessor system introduces some potential for ine ciency. A wakeup operation may cause a number of processes to be resumed in parallel and contend for the critical section thus causing all but one of these processes to be suspended again [Bach 1984]. This is termed as the problem of thundering herd [Ruane 1991] [Campbell et al. 1991]. To avoid this kind of problem some of the multiprocessor implementations of UNIX have introduced the concept ofWakeupOneProcess [Campbell et al. 1991], which causes only the highest priority sleeping process to be resumed. Another concept adopted by some parallel UNIX designs is the introduction of mutex locks that are implicitly released on a sleep call and reacquired on wakeup [Campbell et al. 1991] [Ruane 1991]. This scheme is used for deadlock avoidance. Synchronization in Message Passing Systems: In message based systems, the concepts of causality and event ordering play an important role in process synchronization [Lamport 1978]. The process group mechanism provides a convenient abstraction that allows a collection of processes to be viewed as one logical entity for control and communication [Liang et al. 1990]. The group concept is supported by the V system at the kernel level but it does not provide any global synchronization mechanisms across process groups. Isis, which is a message-passing programming environment, provides a set of high level abstractions for communication and synchronization in a group of processes [Birman 1985]. Two useful abstractions supported by it are atomic broadcast and causal broadcast. If two messages are sent to a group using the atomic broadcast primitive, then all members of the group receive the two messages in exactly the same order. In case of causal broadcast, all such messages are delivered to the group members in an order that is consistent with the causality relationship among the messages. However, the utility of causal and total ordering of communication has been a matter of recent debate [Cheriton and Skeen 1993]. The contention is that the end-to-end argument [Saltzer et al. 1984] is violated in attempting 13 to solve application-level problems (such as ordering requirements) at the communication level. Database-style transaction mechanisms can provide a viable alternative. However, in the presence of a causally and totally ordered communication subsystem, the programming model seen by the user is signi cantly simpli ed [Birman 1994]. 3.3 Scheduling and Load Balancing In this section rst we describe some of the process/thread scheduling techniques that have been used in shared memory multiprocessor systems. Next we examine the impact of synchronization requirements on scheduling. In multicomputer and distributed systems load balancing mechanisms are crucial for high performance. Process migration is an important mechanism to dynamically balance load in a system. The issues related to the designs of these mechanisms are examined here. Scheduling Issues: Some of the important problems that arise in scheduling of parallel jobs on shared memory multiprocessors are outlined in [Tucker and Gupta 1989]. These problems are equally relevant in multicomputer and distributed system scheduling. In general the problems arise when the number of processes | here each process represents one kernel level thread | exceeds the number of available processors. Descheduling of processes in spin-lock controlled critical sections can cause other processes to spin until the lock-holder process is scheduled again. Producer-consumer like synchronization relationships among processes can also result in the consumer wasting cycles until the producer gets scheduled. Similarly, in barrier synchronization, delaying of one process can cause the entire set of processes in its group to be delayed at the barrier. When the number of processes exceeds the number of processors, the scheduler may resort to time-slicing and thus cause overheads due to context-switches. Marsh et al. [1991] use the term context-switch to refer to the case when a processor is switched from one thread to another in the same address-space. This is less expensive as compared to processor reallocation when the processor is switched to a thread in a di erent address-space. This is because of the loss of data in caches and TLB and also for the reason that more information needs to be saved. Tucker and Gupta argue that for good performance the total number of runnable processes for a job should be kept equal to the number of processors available to that job. The various approaches for multiprogram scheduling on multiprocessor and distributed systems fall into two broad categories: space partitioning and time-sharing. Space partitioning schemes allocate a dedicated set of processors to a job, whereas time-sharing schemes time-slice processors among the processes from di erent jobs. In the context of time-sharing schemes the concept of co-scheduling was originally proposed by Ousterhout [1982]. This requires that the processes belonging to a job are given a time-slice on di erent processors at about the same time. This is to reduce the e ect of synchronization problems mentioned above. This requires that the processors be closely synchronized in making scheduling actions. To eliminate the problem due to descheduling within a spin-lock controlled critical 14 section, various proposals have been made. One approach is to use a scheduler which does not preempt a process that is in a critical section [Zahorjan et al. 1988]. In Symunix group based scheduling primitives are proposed to set di erent policies for scheduling and preemption [Edler et al. 1988]. In one policy the traditional scheduling and preemption policies of UNIX are applied to all processes in a group. The second kind of group policy requires that all members be scheduled and preempted at approximately the same time. A third policy makes the members a group immune from preemption. This proposal also includes mechanisms for requesting the scheduler a temporary non-preemption policy for short-term conditions. Most scheduling algorithms allow assignment of di erent priorities to threads. Priority inversion problem arises when a higher priority thread is blocked on a resource that is being held by a lower priority thread. Some systems support preemptable scheduling of threads [Eykholt et al. 1992], a higher priority thread can preempt a lower priority thread and acquire a processor. Support is also provided for disabling preemption of a thread for some bounded sections of its code. In Mach, a higher priority thread can voluntarily relinquish the processor instead of spinning [Black 1990]. Also, if the blocked thread knows the identity of the current lock holder then it can hando the processor to that thread. SunOS 5.O uses priority inheritance; a thread's priority is determined by the priorities of the threads that it is blocking [Eykholt et al. 1992]. In shared memory multiprocessor systems load balancing schemes are relatively easy to implement as compared to multicomputer and distributed systems. In shared memory multiprocessors it is possible to maintain just one run-queue from which all processors dequeue runnable threads. However this data structure can become a hot-spot of contention. For this reason, in many systems a separate local queue is maintained for each processor in addition to a global queue [Black 1990]. Relocating a thread to another processor simply involves putting its control block in that processor's local queue. 3.3.1 Scheduling on Shared Memory Multiprocessors Multiprogram Scheduling: A number of scheduling policies for multiprogrammed multiprocessors are evaluated in [Leutenegger and Vernon 1990]. One of the policies evaluated there is the dynamic partitioning policy proposed by Tucker and Gupta [1989] where each job is given a dedicated group of processors whose size may change dynamically. It also assumes that an application can match the number of its runnable processes to the number of available processors. Other policies evaluated there include the following. In FCFS all the processes of new job are placed in the global run queue. Smallest Number of Processes First (SNPF) policy and its preemptive version give highest priority to the processes from jobs with the smallest number of unscheduled processes. Coscheduling picks a number of runnable processes from a global list in a round-robin fashion, to run one on each of the processors for the next quantum. Processes of a given job appear in contiguous positions in this list; blocked processes are 15 skipped in the selection process. RRprocess policy invokes a round-robin scheduling policy on a global run-queue. RRjob policy uses round-robin scheduling of jobs maintained in a global queue. The size of the quantum is adjusted in proportion to the number of processes in the job. The general conclusion of the evaluation by Leutenegger and Vernon indicates that these policies have the following performance ranking (going from best to worst): dynamic partitioning, RRob, RRprocess, coscheduling, SNPF, and FCFS. Studies presented by Zahorjan and McCann [1990] also indicate that dynamic scheduling policies that allow a job to dynamically acquire and release a processor are superior to static scheduling. In case of static scheduling policies, run-to-completion performs better than a round-robin scheme. Master-Slave Scheduling: In a master-slave system, when a thread running on one of the application processors requests some system processing, it is suspended and put in the master's local queue. Preemptive reschedule is performed whenever a higher priority thread becomes ready to run on the master. Wendorf et al. [1989] have evaluated some scheduling policies for master-slave systems. This study was conducted on a con guration with small number of processors. Two policies presented there are called OS-priority and OS-preempt. In the OS-priority policy the master always gives higher priority to processes in its local queue. A process will continue to run on the master after it has completed its system function execution and even when other processes are waiting in the master's local queue. There is no preemptive rescheduling. In OS-preempt scheme, the master avoids doing any application processing whenever there is a process waiting in its local queue. Wendorf et al. show that the OS preempt policy provides the best performance and it can perform as well as a fully symmetric system. Interrupt Handling and Dispatching Issues: Handling of interrupts in multiprocessor systems needs special attention. Generally in uniprocessor systems, interrupts are handled on the kernel stack of the interrupted process. The process resumes only when the interrupt handler has completed its execution. In SunOS 5.0 kernel interrupts are treated as asynchronously created threads. For e cient interrupt handling complete creation of such a new thread is expensive. Therefore some preallocated and partially initialized pool of such threads is maintained for quick handling of interrupts. The interrupted thread is pinned under the interrupt thread until the interrupt thread returns or blocks and cannot be scheduled on another processor. Interrupts may get nested in this architecture. If an interrupt handling thread gets blocked on a synchronization variable, it is turned into a fulledged thread capable of being scheduled on any processor. At this point the interrupted thread can be resumed. 3.3.2 Scheduling on Multicomputer and Distributed Systems: Load Balancing: 16 Load distribution and balancing schemes [Shivaratri et al. 1992] are needed in multicomputer and distributed operating systems to dynamically improve system performance. In local area networks a number of tools have been designed for distribution of batch jobs among idle workstations [Litzkow et al. 1988]. In multicomputer operating systems similar kinds of schemes have been used for better utilization of resources [Smith 1988] [Barak and Litman 1985] [Zacew et al. 1993]. There are four fundamental components of any dynamic load balancing architecture [Shivaratri et al. 1992]: transfer policy, selection policy, location policy, and information policy. Transfer policy determines the state when a processor should participate in a load balancing activity. Typically it is expressed in terms of some measures (such as exponentially averaged CPU utilization or number of jobs in the run queue at that node) that indicate the load status. Selection policy selects a candidate task for transfer to another node to balance load. In general, migrating a partially executed task (process) poses a number of problems as discussed below. Some of these are primarily related to location-dependent system calls. The location policy selects a suitable partner for performing the process migration. The information policy determines how to communicate the load status of nodes across the network. Condor uses a centralized scheme, whereas MOS uses a decentralized approach. The MOS scheme has been adopted in an OSF/1 based operating system for massively parallel architectures [Zacew et al. 1993]. Finally, a load balancing scheme should be stable and not lower the overall system performance. Process Migration Issues: The capability to fork a new process on another node only partially helps in distributing load. The basic reasons for supporting process migration are performance improvement and fault-tolerance. Surveys of process migration issues are presented by Smith [1988] and Douglis and Ousterhout [1991]. The presence of a single le system name-space across the network, as in Locus [Walker et al. 1983], Sprite [Douglis and Ousterhout 1991] and OSF/1 AD TNC [Zacew et al. 1993], avoids some cumbersome problems in process migration. Heterogeneity of computers can add another dimension to this complexity. A process migration design has to choose among alternatives that are related to trade-o of four factors: transparency, residual dependence, performance, and complexity. Residual dependencies are related to the requirement of supporting continued service on the source machine for the needs of the migrated process. The Condor system maintains a shadow process on the source machine to service all the location dependent system calls. Condor's approach becomes ine cient if a migrated process makes such calls very frequently. In contrast, Sprite relocates le system related state to the new host to minimize residual dependencies. Transparency implies that processes and users do not notice that the host node of migrated process has changed; for example its process-id remains the same. The state of process that needs to be transferred to the destination node includes its virtual memory, state of open les, message channels' states, execution state, and kernel state (pid, user-id, signal masks etc) [Douglis and Ousterhout 1991]. Several di erent approaches have been adopted by various systems for virtual memory transfer. In Locus the process is rst suspended, the entire state (including its virtual memory) is transferred to 17 the destination, and then the process is resumed. This can cause a process to be suspended for a signi cant amount of time, resulting in time-outs by other processes interacting with it. In the V system, while the virtual memory is being transferred the process continues to execute at the source node. In Accent, transfer of virtual memory pages occurs on demand. Thus residual dependencies can last for a long time and this can be undesirable if source node failure is a matter of concern. In Sprite, dirty pages are ushed to the le server (the le service provides a coherent view of the les) during migration and the destination node retrieves these pages from the le-server as they are needed. The state of open les consists of le reference, cached blocks, and access position. The le reference indicates the location and organization of the le's data on secondary storage. In the Sprite design, le reference and access position information are transferred to the destination node. If migration leads to concurrent sharing of the le at more than one node, then caching is disabled and the access pointer is maintained only at the server. In the Locus design, when a le is concurrently shared at di erent nodes, access to the le pointer is synchronized using a token-based approach. In order to perform an I/O operation, the node must acquire the token, which guarantees mutual exclusion in updating the access pointer. 3.4 Object-Based and Object-Oriented Approaches A survey of major issues and approaches in designing object-based distributed programming systems is presented by Chin and Chanson [1991]. Objects are classi ed as persistent vs. non-persistent, and passive vs. active. A non-persistent object depends on some process (thread) for its existence but a persistent object does not. Passive objects are just containers of data; no process is permanently bound to such an object. A process may execute within several objects during its execution. In case of an active object, one or more processes may be associated with it. Objects in Clouds are passive, where as objects in Chorus, Amoeba, Argus, and Eden are active entities. In Chorus and Amoeba an object is bound to a single server process. In Argus an object is implemented by a dynamic set of language-supported lightweight processes. In Eden a UNIX process is used to implement an object. Granularity of objects supported by the system is another design issue which has an impact on object management overhead. Large-grain objects execute a relatively large number of instructions to perform an interface operation. These objects may encapsulate a large number of primitive data types such as integers and reals. Typically a large grain object corresponds to a virtual address space. Objects in Clouds and Eden are large grain; a Clouds object corresponds to a virtual address space composed of segments. Objects in Mentat, which is a C++ based distributed/parallel programming system, are also large grained because each object is implemented as an operating system process [Grimshaw 1993]. Medium grain objects are relatively small, and can be created and maintained in a large grain object. Fine grain objects correspond to primitive data types such as integers, reals, etc. In case of passive objects, there are two distinct techniques for implementing invocation 18 of an object's operation by a thread. One is to map the object in the invoking thread's virtual address space. The thread then executes the operation as a conventional procedure call. The second technique is to locate the object and create a new thread at the remote node on behalf of the invoker thread. The results of invocation are nally passed back to the invoker. Clouds supports both these techniques. For active objects, invocations are based on remote procedure calls. Several systems, such as Cronus and Nexus, have provided asynchronous RPC facilities to exploit parallelism. A survey of various asynchronous RPC facilities is presented by Ananda et al. [1992]. A related development is the introduction of the concept of futures in Cronus and Mentat. Futures allow dataow relationships between processes to be established at run-time. It also allows one to introduce more parallelism in the computations of a client and a server. After making an invocation, the client can proceed without waiting for the results until they are needed. Furthermore, the results of an invocation can be passed as parameters to a second invocation even before the rst one has completed. Using futures-based communication, the server of the rst invocation can be instructed to directly forward the results to the server of the second invocation. Object-oriented approaches have been used by several operating system projects. The Choices design uses C++ to de ne a family of operating systems using class hierarchy. A member of this family can be specialized to a particular hardware or application. A framework in Choices is a collection of abstract classes that represent generalized components. A subframework re nes a framework with subclasses and constraints; it represents a specialized component. A subclass behaves as a subtype of its superclass based on inclusion polymorphism. For example, PageTable is de ned as an abstract class that is used to abstract a hardware page table. The interface methods for such an object are addTranslation, removeTranslation etc. A concrete subclass of this, for example VAXPageTable or 68KPageTable, would implement page table for a speci c hardware. Another example is the implementation of processes. State information of a process is captured into two classes: Process and ProcessContext. Process class captures hardware independent attributes of a process. This class can be further specialized as ApplicationProcess or SystemProcess. ProcessContext class captures hardware dependent context information about the process. In Choices, the invocations on objects are based on C++ method lookup procedures. System objects cannot be accessed directly by user-level objects. User level objects have access to object proxies which contain indirect reference to system objects. This allows privilege levels to be crossed in a transparent fashion. In contrast, in Clouds user level objects can access system level objects only via the system call interface. Clouds is implemented using C++ based class hierarchies, but its user interface supports an object-based computing model. Object-orientation in applications is supported at the programming language level by Distributed C++ and Distributed Ei el. Apertos architecture centers around the concept of an object and its associated meta-objects [Yokote 1992]. Meta-objects implement the object and have access to its internal state. The meta-objects 19 of an object may change dynamically, thus supporting mobility. By associating an object with a new set of meta-objects one could change it attributes; for example, an object could be changed from volatile to persistent. The Objective PEACE approach is to de ne a small set of minimal basis of system functions for process management and communication that can be used to derive tailored microkernels for massively parallel systems [Cordsen and Schroeder-Preikschat 1991]. 3.5 Virtual Memory and Distributed Shared Memory In this section, we discuss three important design trends memory-mapped les/objects, external memory managers and distributed shared memory. The concept of memory mapped objects or les is not new. Multics supported dynamic mapping of storage segments into the virtual address space of a process [Daley and Dennis 1968]. This has been used in IBM System/38, and later in IBM's AS/400 and Apollo's architectures. Most of the contemporary systems such as Mach, Chorus, and Clouds have supported this concept. To access a memory object, a task maps that object on some part of its virtual address-space. During instruction execution, access to a particular location in that part of address-space may result in page-fault if the corresponding page is not present (cached) in the task's primary storage. This causes a page-fault handling request to be sent to the task managing the object. The manager provides the requested page and may enforce desired synchronization if the page is being concurrently shared by other tasks in the system [Rashid et al. 1988]. For example, the manager may make sure that if some task is writing a page then no other task can read it, and similarly no two tasks concurrently write the same page at any time. A slightly modi ed form of this scheme has been used in the OSF/1 AD TNC system for massively parallel multicomputers [Zacew et al. 1993]. This system has introduced another layer in each task to execute coherency protocols with other tasks, thereby relieving the memory manager task from this burden. For this it has added the external memory manager (XMM) of Mach to the kernel to support the emulation of distributed shared memory. Distributed shared memory (DSM) systems have been investigated quite widely over the past decade in the context of multicomputer systems and distributed systems, which do not support physically shared memory between processes running on di erent nodes. A DSM system uses operating system or software library level mechanisms to emulate shared memory in the virtual address spaces of processes running on di erent nodes. There are some hardware supported distributed shared memory systems such as DASH and Memnet. Most DSM systems use the virtual memory support mechanisms in their implementations. Others provide modi ed compilers that detect references to shared data, and emulate the shared memory in software. In the following discussion we address software level mechanisms and related issues for implementing distributed shared memory. Paradigm is a hardware-software based hybrid scheme for massively parallel multicomputers [Cheriton et al. 1991]. Software based DSM schemes have been implemented on workstation networks and hypercube based multicomputers. 20 There are many motivations for supporting distributed shared memory. One is the simplicity of the interprocess communication abstraction in a coherent shared memory system. Writing parallel programs in such a model is easier in comparison to message-based systems. The programmer is relieved from the burden of programming complex communication protocols. This makes it easy to program, debug and maintain applications. Complex data structures shared by distributed processes are not required to be explicitly passed around between nodes, in messages or RPC calls. DSM systems can migrate data to the node where it is being accessed and thus exploit locality of references to reduce network communication. Another reason is to make available the entire memory of a distributed memory machine to programs whose memory needs exceed the memory available at a single node. This was one of the motivations for implementing a distributed virtual memory system on the iPSC/2 [Li and Schaefer 1989]. The design issues in DSM systems are related to the structure of virtual address spaces, granularity of consistency, coherence protocols, synchronization mechanisms, and replacement policies [Nitzberg and Lo 1991] [Tam et al. 1990]. Implementing DSM in heterogeneous systems poses additional problems that are commonly associated with the internal representation of data. A virtual address can be at or a collection of segments. Typical hardware based DSM systems use block size of 16 to 64 bytes as the unit of consistency. Systems, such as Ivy [Li and Hudak 1989], that are based on virtual memory mechanisms use page size as the unit of consistency. Paradigm, which is a hybrid system, uses pages of size 4K as the mapping unit, but supports consistency at the ne-grain level of 64 byte blocks. Systems such as Clouds support consistency at the level of shared segments mapped in the virtual address spaces of di erent objects. Other systems, such as Munin [Bennett et al. 1990] and Orca [Bal et al. 1992], have supported consistency at the level of structured objects. A classi cation of consistency protocols for DSM is presented by Stumm and Zhou [1990]. These are classi ed along two dimensions, namely migrating/non-migrating, and replicated/non-replicated data. Most commonly used protocols are based on read-replication, which supports the single-writer/multiple-reader based synchronization. A shared item can be migrated to or replicated in the address spaces of multiple processes as long as it is being read only. This is identical to the write-invalidate approach for cache coherence [Archibald and Baer 1986]. In distributed systems and multicomputers it is implemented using directories which keep track of the current owner (the process that last modi ed the item) and the current list of copies. In large systems the current trends favor directory based schemes using the write invalidate approach. In case of full-replication based scheme for memory coherence, a copy of the data is created in the address space of each process using the item. A write operation is broadcast to all copies. This is identical to the write-update protocol [Archibald and Baer 1986]. To support it in a large multicomputer or distributed system one needs a multicasting facility. 21 One also needs atomic broadcast primitives to synchronize concurrent updates from two di erent nodes. Orca, which is a distributed/parallel programming language implemented above Amoeba, uses this type of approach for managing shared objects [Bal et al. 1992]. An approach based on a central server (i.e. non-migrating and non-replicated) is not attractive because each access to the shared object incurs network communication and its associated latency. It is not able to exploit the locality of references, which a migration based scheme does. Systems such as Munin provide multiple coherence protocols [Carter et al. 1991]. The programmer can optimize shared memory accesses by providing information about the expected access patterns of shared variables, as part of the variable declaration. Based on this information, the system can choose a suitable consistency protocol. Also, a weaker consistency model (release consistency) allows Munin to reduce the overheads associated with these consistency protocols. Porting a parallel program from a shared memory machine to a DSM system like Munin is relatively easy, since only minor variable annotations are needed. 3.6 Parallel and Distributed File Systems Distributed File Systems: Numerous le systems for distributed computing environments have been developed over the past decade; these are relatively mature as compared to le systems for parallel machines. For this reason, we focus more on the latter, and refer the reader to [Levy and Silberschatz 1990] for a detailed discussion of distributed le systems (DFS). Naturally, several le system issues are common to both distributed and parallel le systems; these include caching policies, cache coherence, naming, synchronization, etc. The di erences between parallel and distributed le systems arise due to the greater in uence of network bandwidth on DFSs. Communication is the costliest component in DFS operation, and so DFS designs are optimized with the intent of reducing network tra c. Often, DFSs have to trade o factors such as cache consistency to improve performance and scalability. Usage Characteristics: Parallel le systems need to support di erent classes of applications, such as scienti c computations, transaction processing, multimedia, etc. These classes display very di erent disk access patterns. Scienti c applications are characterized by mainly sequential access to large les, so that the data transfer rate is more important than the I/O rate. Transaction processing, on the other hand, implies large numbers of random requests, for smaller amounts of data. The typical access pattern is read-modify-update. Multimedia applications usually deal with very large sequential les, and data compression techniques are needed as part of the le system. Often, les are composed of logically independent streams of data, which may be accessed in parallel by multiple processes. Thus, these di erent classes of applications demand di ering services from the le system and the underlying parallel I/O 22 system. Often, on parallel machines, the le system is used only as a staging area, for storing input data, generated results and the temporary les needed during a parallel computation. Long-term archival storage is provided by a separate le system, usually located on the host (front-end) machine. Naturally, storage for archival purposes will have di erent access patterns than a staging area. Thus the design of the le system depends on its intended usage. Most current implementations of parallel le systems make certain assumptions about the anticipated nature of the accesses to le data. In general, these systems assume that the les in a supercomputing environment are typically read sequentially in large chunks. Their design is therefore optimized for large sequential read operations. The le system attempts to allocate contiguous blocks of the disk to such les, thereby minimizing the seek operations needed, and improving the transfer bandwidth. On the other hand, in other approaches such as log-structured le systems, the assumption is that most read accesses can be satis ed by a main-memory cache, and therefore, systems like Sprite are optimized with the assumption that write accesses dominate [Rosenblum and Ousterhout 1992]. File System Organization: The le system built on top of a multiple-disk I/O subsystem can either provide a single global view or allow the user to access the local disks independently. The Intel CFS for example, makes the underlying multiplicity of disks transparent to the user [Pratt et al. 1989]. The user sees only a uniform logical namespace, and has no way to access the individual disks. In contrast, the nCUBE's le system is composed of one global layer overlaid on top of multiple local le systems [Pratt et al. 1989]. The user can directly interact with the local le systems, bypassing the global layer altogether. A similar organization is also used in the Bridge le system [Dibble et al. 1988]. In conventional le systems, the fragmentation of a le's data on the disk leads to expensive seek operations while accessing the le. However in log-structured le systems, the only representation on disk is an append-only log [Rosenblum and Ousterhout 1992]. This representation eliminates the need for seeks during write accesses. This is a major gain, since the seek operation is primarily mechanical, and is the costliest component of the disk access time. File System Interfaces: File systems on existing multiprocessors are usually based on the conventional UNIX-like interface, which provides primitives such as open, close, read, write, seek, etc.[Kotz 1992]. Advantages of such systems include transparency of the underlying parallel disk architecture (and therefore portability), and familiarity of the interface for most programmers. However, the conventional interface does not allow the sophisticated programmer to access the underlying parallelism directly. Often, knowing the logical data layout and access patterns for 23 his/her application, the programmer can optimize the physical data layout on the disk, and write more e cient I/O code. So, to support genuine parallel access to data, the interface must provide additional primitives such as a parallel open, asynchronous I/O primitives, etc. In conventional le systems, an open call allows the system to authenticate the user and set up internal data structures, which help avoid repeated accesses to the le's metadata. In a parallel system, the le is shared amongst several processes. Having each of them issue an open call is wasteful. A parallel open can also de ne some logical partitioning of the data, and specify which of those partitions is being accessed. A le system may also support the de nition of logical views of the data using mapping functions. The programmer could for example impose a column-major view on a matrix which was stored internally in row-major order, and the le system would serve up the data in column-major order. A multiopen call is proposed by Kotz [1992] which opens the le for a prede ned process group. All processes in the group are then provided with a le handle to access the shared le. The Bridge le system also provides a parallel open call of this nature [Dibble et al. 1988]. In Vesta, the process of gaining access to a le has been split into two phases { an attach, which corresponds to the conventional open, and an open which de nes the logical partition to be used [Corbett et al. 1993]. Some le systems (such as Vesta [Corbett et al. 1993] and the Intel CFS [Pierce 1989]) provide le creation primitives that allocate a speci ed amount of disk space for the le at the time of creation. When the application knows a priori the amount of disk space needed by the le (or a lower limit), this call helps ensure that the required resources are available. In conventional le systems, when a le is being appended to, new disk blocks need to be allocated at runtime. On multiple-disk systems, this could involve the synchronization of a set of processes handling the disks. Therefore, allocation of disk space is an expensive procedure in parallel le systems. If on the other hand, the allocation is done just once at creation-time, these extra allocation requests are not needed. Thus, the le system's performance on write operations is enhanced. Shared les can be opened in several di erent modes, depending on the intended use. For example, the Intel iPSC's Concurrent File System provides four modes [Asbury and Scott 1989]. In one mode, each process sharing the le gets its own copy of the le pointer, and can read or write whichever parts of the le it needs to. There is no synchronization between processes. In another mode, all the processes perform sequential I/O with respect to a global le pointer. Other modes allow processes to read/write the le in strictly interleaved fashion. The le system may also permit self-scheduled access, in which the processes randomly access a global le pointer, perform I/O, and appropriately modify the shared pointer in one atomic operation. Thus, in this case, the interleaving of I/O operations among processes is arbitrary. For those modes which use a shared le pointer, the le system needs to provide a mechanism for serializing accesses to the pointer. A read or a write operation on a shared le requires fetching the current pointer and incrementing it by the appropriate amount before the actual data transfer can take place. 24 The le system may also provide parallel versions of the read and write calls for le I/O,with semantics that are somewhat di erent from those of the corresponding conventionalcalls. For example, in Bridge, when a read operation is issued for a shared le, multipleblocks are read and one block each is transferred to the members of the process group shar-ing that le. In Parasoft Express' Cubix model of operation, a single read operation cansupply identical copies of the input data to multiple processes, and a single output operationcan result in all the cooperating processes sending di erent data items to be written out[Parasoft 1990]. New le systems also depart from the conventional UNIX model by provid-ing support for structured les, i.e. les of records rather than simple byte-streams. Thus,a read or write in such a system would involve record-based I/O.In order to provide a recovery mechanism, le systems also need primitives for check-pointing. An example is the Vesta le system, which can rollback the system to an earlierstate at which a checkpoint was recorded [Corbett et al. 1993]. The checkpoint is maintainedin terms of metadata rather than data in the le.4 ConclusionA survey of the existing operating systems for distributed and parallel machines reveals sev-eral interesting trends. Most systems attempt to maintain some level of compatibility withthe familiar UNIX model. Some even make this one of the basic goals of the design. Inshared memory multiprocessors, multithreaded kernels supporting symmetrical processingof OS functions are used. Scalability limitations of such systems lead to heterogeneous mul-ticomputer architectures for building massively parallel systems. Such systems typically usea micro-kernel at each node and distribute the OS service functions on separate pools of pro-cessors. This leads to a convergence of designs of parallel and distributed operating systems,since the latter are primarily based on message communication. However, distributed sharedmemory, memory-mapped objects and les are often implemented on top of this model toallow programmers to continue using the more familiar shared memory model .The in uence of distributed systems on the development of multiprocessor operating sys-tems is clearly seen. For example, the micro-kernel approach popularized by Mach is nowadopted for developing multiprocessor operating systems as well. The task-thread model iswidely used to express parallelism, whereas the concept of protection domains is provided us-ing the object model. Most modern operating systems provide lightweight kernel-supportedprocesses to the application programmers. For expressing concurrency in applications sys-tems, the current trend is to support a two-level model implementing threads at the userlevel and and lightweight processes at the kernel level.For parallel machines, the I/O problem is widely recognized as a major hurdle to achiev-ing high performance. Data declustering onto multiple-disk I/O systems is now seen as anessential strategy for combating the I/O bottleneck. File systems also extend their UNIX-like set of primitives to include primitives suitable for operations on parallel les. These areessential for the programmer to extract the maximum performance out of the parallel le25 organizations.Acknowledgements: The authors are deeply thankful to the anonymous reviewers whosedetailed and insightful comments on the earlier draft of this paper helped us tremendouslyin improving our presentation.References[Almes et al. 1985] G. T. Almes, A. P. Black, E. D. Lazowska, and J. D. Noe. The EdenSystem: A Technical Review. IEEE Transactions on Software Engineering, SE-11:43{59,January 1985.[Ananda et al. 1992] A. L. Ananda, B. H. Tay, and E. K. Koh. A Survey of AsynchronousRemote Procedure Calls. Operating Systems Review, pages 92{109, April 1992.[Anderson et al. 1989] Thomas E. Anderson, Edward D. Lazowska, and Henry M. Levy.The Performance Implications of Thread Management Alternatives for Shared-MemoryMultiprocessors. IEEE Transactions on Computers, pages 1631{1644, December 1989.[Anderson et al. 1991] Thomas Anderson, Brian Bershad, Edward Lazowska, and HenryLevy. Scheduler Activations: E ective Kernel Support for the User-Level Management ofParallelism . In Proceedings of the 13'th ACM Symposium on Operating System Principles,pages 95{109, 1991.[Anderson 1990] Thomas Anderson. The Performance of Spin Lock Alternatives for SharedMemory Multiprocessors. IEEE Transactions on Parallel and Distributed Systems, Jan-uary 1990.[Archibald and Baer 1986] James Archibald and Jean-Loup Baer. Cache Coherence Pro-tocols: Evaluation Using a Multiprocessor Simulation Model . ACM Transactions onComputer Systems, pages 273{298, November 1986.[Asbury and Scott 1989] Raymond K. Asbury and David S. Scott. FORTRAN I/O on theiPSC/2: Is there read after write? In Proceedings of the Fourth Conference on Hypercubes,Concurrent Computers and Applications, pages 129{132, 1989.[Athas and Seitz 1988] W. Athas and C. Seitz. Multicomputers: Message-Passing Concur-rent Computers. IEEE Computer, August 1988.[Bach 1984] Maurice Bach. Multiprocessor UNIX Operating Systems. Bell LaboratoriesTechnical Journal, pages 1733{1749, October 1984.[Bal et al. 1992] H. E. Bal, M. A. Kaashoek, and A. S. Tanenbaum. Orca: A Language forParallel Programming of Distributed Systems. IEEE Transactions on Software Engineer-ing, Vol. 18, No. 3:190{205, March 1992. 26 [Barak and Litman 1985] A. Barak and A. Litman. MOS: A Multicomputer DistributedOperating System. Software Practice and Experience, pages 725{737, August 1985.[Bennett et al. 1990] J. Bennett, J. Carter, and W. Zwaenepoel. Munin: Distributed SharedMemory Based on Type Speci c Memory Coherence . In Proc. of 1990 Conf. on Principlesand Practice of Parallel Programming, pages 168{176, 1990.[Bershad 1990] Brian Bershad. The Increasing Irrelevance of IPC Performance forMicrokernel-Based Operating Systems. In Proceedings of the Summer 1990 USENIX,1990.[Birman 1985] Kenneth P. Birman. Replication and fault-tolerance in the ISIS system. TenthACM Symposium on Operating Systems Principles, pages 79{86, December 1985.[Birman 1994] Kenneth Birman. A Response to Cheriton and Skeen's Criticism of Causaland Totally Ordered Communication . ACM Operating Systems Review, pages 11{21,January 1994.[Black 1990] David Black. Scheduling Support for Concurrency and Parallelism in the MachOperating System. IEEE Computer, pages 35{43, May 1990.[Butler and Lusk 1992] Ralph Butler and Ewing Lusk. User Guide to the P4 ProgrammingSystem. Technical report, Argonne National Laboratory, October 1992.[Campbell et al. 1991] Mark Campbell, Richard Barton, Jim Browning, Dennis Cervenka,Ben Curry, Todd Davis, Tracy Edmonds, Russ Holt, John Slice, Tucker Smith, and RichWescott. The Parallelization of UNIX System V Release 4.0. In Proceedings of the Winter1991 USENIX, pages 307{323, January 1991.[Campbell et al. 1993] Roy Campbell, Nayeem Islam, David Raila, and Peter Madany. De-signing and Implementing Choices: An Object-Oriented System in C++. Communicationsof the ACM, 36(9):117{126, September 1993.[Carter et al. 1991] John Carter, John Bennett, and Willy Zwaenepoel. Implementationand Performance of Munin . In Proceedings of the 13'th ACM Symposium on OperatingSystem Principles, pages 152{164, 1991.[Cheriton and Skeen 1993] David Cheriton and Dale Skeen. Understanding the Limitationsof Causally and Totally Ordered Communication . In Proceedings of the 14'th ACMSymposium on Operating System Principles, pages 44{57, 1993.[Cheriton et al. 1991] David Cheriton, Hendrik A. Goosen, and Patrick D. Boyle. Paradigm:A Highly Scalable Shared-Memory Multicomputer Architecture . IEEE Computer, pages33{46, February 1991.[Cheriton 1988] David Cheriton. The V Distributed System. Communications of the ACM,March 1988.27 [Chin and Chanson 1991] Roger S. Chin and Samuel T. Chanson. Distributed Object-BasedProgramming Systems. ACM Computing Surveys, pages 91{124, March 1991.[Corbett et al. 1993] Peter F. Corbett, Sandra Johnson Baylor, and Dror G. Feitelson.Overview of the Vesta parallel le system. In IPPS '93 Workshop on Input/Output inParallel Computer Systems, pages 1{16, 1993.[Cordsen and Schroeder-Preikschat 1991] J. Cordsen and W. Schroeder-Preikschat. Object-Oriented Operating Systems Design and the Revival of Program Families. In 1991 Inter-national Workshop on Object-Orientation in Operating Systems, pages 24{28, 1991.[Dahl and Nygaard 1966] Ole-Johan Dahl and Kristen Nygaard. SIMULA | an ALGOL-Based Simulation Language. Communications of the ACM, pages 671{678, September1966.[Daley and Dennis 1968] Robert C. Daley and Jack B. Dennis. Virtual Memory, Processes,and Sharing in Multics. Communications of the ACM, pages 306{369, May 1968.[Dasgupta et al. 1991] P. Dasgupta, R. LeBlanc, M. Ahamad, and U. Ramachandran. TheClouds Distributed Operating System. Computer Magazine, pages 34{44, November 1991.[Detlefs et al. 1988] David L. Detlefs, Maurice P. Herlihy, and Jeannette M. Wing. Inher-itance of Synchronization and Recovery Properties in Avalon/C++. IEEE Computer,pages 57{69, December 1988.[Dibble et al. 1988] Peter Dibble, Michael Scott, and Carla Schlatter Ellis. Bridge: A high-performance le system for parallel processors. In Proceedings of the 8th InternationalConference on Distributed Computing Systems, 1988.[Dongarra et al. 1993] Jack Dongarra, Rolf Hempel, Anthony Hey, and David Walker. AProposal for a User-Level Message Passing Interface in a Distributed Memory Environ-ment. Technical report, Oak Ridge National Laboratory, June 1993.[Douglis and Ousterhout 1991] Fred Douglis and John Ousterhout. Transparent ProcessMigration: Design Alternatives and the Sprite Implementation. Software Practice andExperience, pages 757{785, August 1991.[Edler et al. 1988] Jan Edler, Jim Lipkis, and Edith Schonberg. Process Management forHighly Parallel UNIX Systems . In USENIX Workshop on Unix and Supercomputers,pages 1{17, September 1988.[Eykholt et al. 1992] J. R. Eykholt, S. R. Keimann, S. Barton, R. Faulkner, A. Shivalingiah,M. Smith, D. Stein, J. Voll, M. Weeks, and D. Williams. Beyond Multiprocessing: Multi-threading the SunOS Kernel. In Proceedings of the Summer 1992 USENIX, pages 11{18,June 1992.[Graunke and Thakkar 1990] Gary Graunke and Shreekant Thakkar. Synchronization Al-gorithms for Shared-Memory Multiprocessors. IEEE Computer, pages 60{69, June 1990.28 [Grimshaw 1993] Andrew Grimshaw. Easy-to-Use Object-Oriented Parallel Processing withMentat. IEEE Computer, pages 39{51, May 1993.[Hansen 1970] Per Brinch Hansen. The Nucleus of a Multiprogramming System . Commu-nications of the ACM, pages 238{241, 250, April 1970.[Hauser et al. 1993] Carl Hauser, Christian Jacobi, Marvin Theimer, Brent Welch, and MarkWeiser. Using Threads in Interactive Systems: A Case Study. In Proceedings of the 14thACM Symposium on Operating System Principles, pages 94{105, December 1993.[Hillis and Tucker 1993] W. Daniel Hillis and Lewis W. Tucker. The CM-5 Connection Ma-chine: A Scalable Supercomputer. Communications of the ACM, 36(11):31{40, November1993.[Karlin et al. 1991] Anna Karlin, Kai Li, Mark Manasse, and Susan Owicki. EmpiricalStudies of Competitive Spinning for a Shared Memory Multiprocessor . In Proceedings ofthe 13'th ACM Symposium on Operating System Principles, pages 41{55, 1991.[Khalidi and Nelson 1993] Yousef A. Khalidi and Michael N. Nelson. Implementation ofUNIX on an Object-Oriented Operating System. In Proceedings of the 1993 WinterUSENIX, pages 469{479, January 1993.[Koller 1989] Je Koller. The MOOS II Operating System and Dynamic Load Balancing.In Proceedings of the Fourth Conference on Hypercubes, Concurrent Computers and Ap-plications, pages 599{602, 1989.[Korson and McGregor 1990] Tim Korson and John McGregor. Understanding object-oriented: A unifying pardigm. Communications of the ACM, Vol. 33, No. 9:40{60, Septem-ber 1990.[Kotz 1992] David Kotz. Multiprocessor le system interfaces. Technical Report PCS-TR92-179, Dartmouth College, May 1992.[Lamport 1978] Leslie Lamport. Time, clocks, and the ordering of events in a distributedsystem. Communications of the ACM, Vol. 21, No. 7:558{564, July 1978.[Lea et al. 1993] Roger Lea, Christian Jacquemot, and Eric Pillevesse. COOL: System Sup-port for Distributed Programming. Communications of the ACM, 36(9):37{46, September1993.[Leutenegger and Vernon 1990] Scott Leutenegger and Mary Vernon. The Performance ofMultiprogrammed Multiprocessor Scheduling Policies. In Proceedings of the 1990 ACMSIGMETRICS Conference on Measurement and Modeling of Computer Systems, pages226{236, 1990.[Levin et al. 1975] R. Levin, E. Cohen, W. Corwin, F. Pollack, and W. Wulf. Pol-icy/Mechanism Separation in Hydra. Proceedings of the Fifth Symposium on OperatingSystems Principles, pages 132{140, November 1975.29 [Levy and Silberschatz 1990] Eliezer Levy and Abraham Silberschatz. Distributed File Sys-tems: Concepts and Examples. ACM Computing Surveys, pages 321{374, December 1990.[Li and Hudak 1989] Kai Li and Paul Hudak. Memory Coherence in Shared Virtual MemorySystems. ACM Transactions on Computer Systems, pages 79{86, November 1989.[Li and Schaefer 1989] Kai Li and Richard Schaefer. A Hypercube Shared Virtual MemorySystem. In Proceedings of the 1989 International Conference on Parallel Processing, pagesI{125{132, 1989.[Liang et al. 1990] L. Liang, S. Chanson, and G. Neufeld. Process groups and group com-munication. IEEE Computer, pages 56{66, February 1990.[Lillevik 1991] Sigurd Lillevik. The Touchstone 30 Giga op DELTA Prototype. In 6thDistributed Memory Computing Conference, pages 671{677, 1991.[Liskov and Zilles 1974] B. H. Liskov and S. Zilles. Programming with Abstract Data Types.SIGPLAN NOTICES, 9(4), 1974.[Liskov 1988] Barbara Liskov. Distributed Programming in Argus. Communications of theACM, 31(3), March 1988.[Litzkow et al. 1988] Martin Litzkow, Miron Livny, and Matt Mutka. Condor a Hunterof Idle Workstations. Proceedings of the IEEE International Conference on DistributedComputing Systems, 1988.[Loepere 1992] Keith Loepere. Mach 3 Kernel Principles. Mach 3.0. Open Software Foun-dation, 1992.[Marsh et al. 1991] Brian Marsh, Michael Scott, Thomas LeBlanc, and Evangelos P.Markatos. First-Class User-Level Threads . In Proceedings of the 13'th ACM Sympo-sium on Operating System Principles, pages 100{121, 1991.[Mellor-Crummey and Scott 1991] John M. Mellor-Crummey and Michael L. Scott. Algo-rithms for Scalable Synchronization on Shared-Memory Multiprocessors. ACM Transac-tions on Computer Systems, pages 19{65, February 1991.[Mullender et al. 1990] S. J. Mullender, G. van Rossum, A. Tanenbaum, R. van Renesse, andH. van Staveren. Amoeba, a Distributed Operating System for the 1990s. IEEE-Computer,pages 44{53, May 1990.[Nitzberg and Lo 1991] Bill Nitzberg and Virginia Lo. Distributed Shared Memory: A Sur-vey of Issues and Algorithms. IEEE Computer, pages 52{60, August 1991.[Ousterhout et al. 1988] J. K. Ousterhout, A. R. Cherenson, F. Douglis, M. N. Nelson, andB. B. Welch. The Sprite Network Operating System. IEEE Computer, pages 23{36, 1988.30 [Ousterhout 1982] J. K. Ousterhout. Scheduling Techniques for Concurrent Systems. InProceedings of the 3rd International Conference on Distributed Computing Systems, pages22{30, October 1982.[Parasoft 1990] Parasoft. Express 3.2 Introductory Guide. Parasoft Corporation, 2500,E.Foothill Blvd, Pasadena, CA 91107, 1990.[Patterson et al. 1988] David Patterson, Garth Gibson, and Randy Katz. A case for redun-dant arrays of inexpensive disks (RAID). In ACM SIGMOD Conference, pages 109{116,June 1988.[Pierce 1989] Paul Pierce. A concurrent le system for a highly parallel mass storage sub-system. In Proceedings of the Fourth Conference on Hypercubes, Concurrent Computersand Applications, 1989.[Powell et al. 1991] M. L. Powell, S. R. Keimann, S. Barton, D. Shah, D. Stein, andM. Weeks. SunOS Multi-thread Architecture. In Proceedings of the Winter 1991 USENIX,pages 1{10, January 1991.[Pratt et al. 1989] Terrence W. Pratt, James C. French, Phillip M. Dickens, and Stanley A.Janet, Jr. A comparison of the architecture and performance of two parallel le sys-tems. In Proceedings of the Fourth Conference on Hypercubes, Concurrent Computers andApplications, pages 161{166, 1989.[Rashid and Robertson 1981] Richard Rashid and George Robertson. Accent: A communi-cation oriented network operating system kernel. Proc. of the 8th Symposium on OperatingSystem Principles, pages 64{75, December 1981.[Rashid et al. 1988] R. Rashid, A. Tevanian, M. Young, D. Golub, R. Baron, D. Black,W. Boloskyi, and J. Chew. Machine-Independent Virtual Memory Management for PagedUniprocessor and Multiprocessor Architectures. IEEE Transactions on Computers, pages896{908, August 1988.[Rosenblum and Ousterhout 1992] Mendel Rosenblum and John Ousterhout. The designand implementation of a log-structured le system. ACM Transactions on ComputerSystems, 10(1):26{52, February 1992.[Rozier et al. 1988] Marc Rozier, Vadim Abrossimov, Francois Armand, Ivan Boule, MichelGien, Marc Guillemont, Frederic Herrmann, Claude Kaiser, Sylvain Langlois, PierreLeonard, and Will Neuhauser. CHORUS Distributed Operating System. ComputingSystems Journal, pages 305{370, December 1988.[Ruane 1991] Lawrence Ruane. Process Synchronization in the UTS Kernel. ComputingSystems, 3(3):387{421, 1991.[Saltzer et al. 1984] J. H. Saltzer, D. P. Reed, and D. D. Clark. End-to-End Arguments inSystem Design . ACM Transactions on Computer Systems, 2(4):277{288, November 1984.31 [Schantz et al. 1986] Richard Schantz, Robert Thomas, and Girome Bono. The Architectureof the Cronus Distributed Operating System. In Proc. of the 6th International Conferenceon Distributed Computing Systems, pages 250{259, 1986.[Schimmel 1994] Curt Schimmel. UNIX Systems for Modern Architectures. Addison-Wesley,1994.[Seitz 1985] C. L. Seitz. The Cosmic Cube. Communications of the ACM, January 1985.[Shivaratri et al. 1992] N. Shivaratri, P. Krueger, and M. Singhal. Load distributing forlocally distributed systems. IEEE Computer, pages 33{44, December 1992.[Smith 1988] Jonathan Smith. A Survey of Process MigrationMechanisms. In ACM SIGOPSNewsletter, pages 28{40, July 1988.[Stein and Shah 1992] D. Stein and D. Shah. Implementing Lightweight Threads. In Pro-ceedings of the Summer 1992 USENIX, pages 1{9, June 1992.[Stumm and Zhou 1990] Michael Stumm and Songnian Zhou. Algorithms implementing dis-tributed shared memory. IEEE Computer, pages 55{64, May 1990.[Sunderam 1990] V. S. Sunderam. PVM: A Framework for Parallel Distributed Computing.Concurrency: Practice & Experience, 2(4), December 1990.[Tam et al. 1990] Ming-Chit Tam, Jonathan M. Smith, and David Farber. A taxonomy-based comparison of several distributed shared memory systems. ACM Operating SystemsReview, pages 40{67, July 1990.[Tripathi 1989] Anand Tripathi. An Overview of the Nexus Distributed Operating SystemDesign. IEEE Transactions on Software Engineering, 15(6), June 1989.[Tucker and Gupta 1989] Andrew Tucker and Anoop Gupta. Process control and Schedulingissues for Multiprogrammed Shared Memory Multiprocessors. In Proceedings of the 12thACM Symposium on Operating System Principles, pages 159{166, 1989.[Walker et al. 1983] B. Walker, G. Popek, R. English, C. Kline, and G. Thiel. The LOCUSdistributed operating system. Ninth ACM Symposium on Operating Systems Principles,October 1983.[Wegner 1987] Peter Wegner. Dimensions of object-based language design. In Proceedingsof ACM OOPSLA-87 Conference, pages 168{182, October 1987.[Weiser et al. 1989] Mark Weiser, Alan Demers, and Carl Hauser. The Portable CommonRuntime Approach to Interoperability. In Proceedings of the 12'th ACM Symposium onOperating System Principles, pages 114{122, 1989.[Wendorf et al. 1989] James Wendorf, Roli Wendorf, and Hideyuki Tokuda. Scheduling Op-erating System Processing on Small-Scale Multiprocessors. In Proceedings of the 22ndHawaii International Conference on System Sciences, pages 904{913, 1989.32 [Yokote 1992] Yasuhiko Yokote. The Apertos Re ective Operating System: The Conceptand its Implementation. Technical Report SCSL-TR-92-014, Sony Computer Science Lab-oratory, June 1992.[Zacew et al. 1993] Roman Zacew, Paul Roy, David Black, Chris Peak, Paulo Guedes, Brad-ford Kemp, John LoVerso, Michael Leibensperger, Michael Barnett, Faramarz Rabbi, andDurriya Netterwala. The OSF/1 Unix for Massively Parallel Multicomputers. In Proceed-ings of the Winter 1993 USENIX, pages 449{468, January 1993.[Zahorjan and McCann 1990] John Zahorjan and Cathy McCann. Processor Scheduling inShared Memory Multiprocessor Systems. In Proceedings of the 1990 ACM SIGMETRICSConference on Maeasurement and Modeling of Computer Systems, pages 214{225, 1990.[Zahorjan et al. 1988] John Zahorjan, Edward Lazowska, and Derek Eager. Spinning versusBlocking in Parallel Systems with Uncertainty. Technical Report 88-03-01, University ofWashington, Seattle, March 1988.33
منابع مشابه
Approaches to Distributed UNIX Systems Jonathan
This paper examines several approaches to developing a distributed version of the UNIX! operating system. Relevant UNIX concepts are introduced and brief overviews of a number of distributed UNiX implementations are provided. The major issues discussed are concurrency and file system architecture: 1. The multiprocessor designs examined have the common problem of implementing critical sections i...
متن کاملA Multiprocessor System with Non-Preemptive Earliest-Deadline-First Scheduling Policy: A Performability Study
This paper introduces an analytical method for approximating the performability of a firm realtime system modeled by a multi-server queue. The service discipline in the queue is earliestdeadline- first (EDF), which is an optimal scheduling algorithm. Real-time jobs with exponentially distributed relative deadlines arrive according to a Poisson process. All jobs have deadlines until the end of s...
متن کاملAn Evaluation of Network Stack Parallelization Strategies in Modern Operating Systems
As technology trends push future microprocessors toward chip multiprocessor designs, operating system network stacks must be parallelized in order to keep pace with improvements in network bandwidth. There are two competing strategies for stack parallelization. Messageparallel network stacks treat messages (usually packets) as the fundamental unit of concurrency, whereas connection-parallel net...
متن کاملChapter 3 Chaos Arc : Real-time Objects and Atomicity for Multiprocessors 3.1 Reliability in Real-time Systems This Chapter Is Based on \chaos Arc : Kernel Support for Multi-weight Objects, Invocations, and Atomicity in Real-time Multiprocessor Applications,"
CHAOS arc is an object-based multiprocessor operating system kernel that provides primitives with which programmers may easily construct objects of diiering types and object invocations of diiering semantics, targeting multiprocessor systems and real-time applications. The CHAOS arc kernel can guarantee desired performance and functionality levels of selected computations in real-time applicati...
متن کاملRapid system prototyping
The March, April, and May 2007 issues of IEEE Distributed Systems Online feature revised versions of the best papers presented at the 17th International IEEE Workshop on Rapid System Prototyping (RSP 06). These articles were selected by reviewers from a large selection of excellent submissions. This month's issue features two articles. In "Prototyping Multiprocessor System-on-Chip Applications:...
متن کاملKernel-Kernel Communication in a Shared-Memory Multiprocessor t
In the standard kernel organization on a shared-memory multiprocessor all processors share the code and data of the operating system; explicit synchronization is used to control access to kernel data structures. Distributed-memory multicomputers use an alternative approach, in which each instance of the kernel performs local operations directly and uses remote invocation to perform remote opera...
متن کامل